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Abstract —Cutting-plane methods are well-stndied localization 
(and optimization) algorithms. We show that they provide a 
natural framework to perform machine learning —and not just 
to solve optimization problems posed by machine learning— in 
addition to their intended optimization nse. In particular, they 
allow one to learn sparse classifiers and provide good compression 
schemes. Moreover, we show that very little effort is required 
to turn them into effective active learning methods. This last 
property provides a generic way to design a whole family of active 
learning algorithms from existing passive methods. We present 
numerical simulations testifying of the relevance of cutting-plane 
methods for passive and active learning tasks. 

I. Introduction 

We show that localization methods based on cutting planes 
provide a natural framework to derive machine learning al¬ 
gorithms for classification, both in the supervised learning 
framework and the active learning framework. Our claim 
is that cutting plane algorithms, beyond their optimization 
purposes, embed features that are beneficial for generalization 
purposes. In particular a) under mild conditions, they may 
provide compression scheme with a compression rate that is 
directly related to their aim at rapidly finding a solution of the 
localization problem and b) the pivotal step of such algorithms, 
namely, the querying step, may be slightly twisted so as to be 
active-learning friendly. 

In the present paper, we show that existing learning algo¬ 
rithms might be revisited from the cutting planes point of 
view. Not only might the active learning SVM procedure of 
Tong and Koller Q be reinterpreted as an algorithm falling 
under the framework we describe but so are the Bayes Point 
Machines Q, for which we will propose an active learning 
version of it. 

The problems we are interested in are linear classification 
problems. Given a training sample D = {{Xn,yn)}ne[N]^ with 
XnGX = R^,y„Gy = {-1, +1}, and [N] = {1,..., N}, 
we are looking for a classification vector w G X that is an 
element of the version space 

Wo(-D) = {w G X : yn{w,Xn) >0,nG [iV]} , (1) 

of D, i.e. the set of vectors w from X such that the corre¬ 
sponding linear predictors 

fw{.x) = s.\gn{{w,x)) ( 2 ) 

make no mistake on the training set D. In order to render the 
exposition clearer, we make the assumption that the training 
data are linearly separable so that 'Wo{D) is not empty. The 
case where Wq{D) = 0 can be tackled with usual machine 
learning techniques —e.g. the “A-trick” and/or kernels Q Q. 


Also, for the sake of brevity, we may use Wq instead of 'Wq{D) 
and thus drop the explicit dependence on D. 

With the relevant notation at hand, the problem we are inter¬ 
ested in may be stated as: 

find w e Wo, (3) 

which might be simply rewritten as the problem of solving a 
set of linear inequalities 

find w s.t. I ^ ^ \ ^ n ^ r/\ri (4) 

[ yn {w,Xn) >0,nG [AJ. 

There is a variety of methods in the optimization literature 
from as back as the 50’s that are available to solve such 
problems. Among them, we may mention (over-)relaxation 
based methods g, g, simplex-based algorithms and, of 
course, the Perceptron algorithm and its numerous variants 
0-0. Localization methods based on cutting planes, or, in 
short, cutting planes algorithms, are well-studied algorithms, 
well-known to be very efficient to solve such problems. We 
will show that, when used to solve ffl, i) they naturally 
provide compression scheme algorithms |0, and thus, learn¬ 
ing algorithms that embed features designed to ensure good 
generalization properties and ii) they also set the ground for 
the development of new active learning algorithms. 

A. Related Works 

Cutting-plane methods provide a family of optimizaton proce¬ 
dures that have received some interest from the machine learn¬ 
ing community ||T0|- GD- However, they have mainly been 
considered as optimization methods to solve problems such as 
those posed by support vector machines or, more generally, 
regularized risk functionals. The more profound connection of 
these methods with learning algorithms, that is, procedures 
that are designed in a way to ensure generalization ability to 
the predictor they build (e.g. the Perceptron algorithm) has 
less been studied; this is one of the peculiarities of the present 
paper to discuss this feature—to some extent, the work of HD, 
which pinpoints how statistical regularization is beneficial for 
the stabilization of cutting-plane methods, skims over this 
connection. Within the vast literature of active learning (see, 
e.g. G])), we may single out a few contributions our work is 
closely related to; they share the common feature of focusing 
on/exploiting the geometry of the version space. The query 
strategies proposed by G3 and p3| are based on multiple es¬ 
timations of the volume of the (potential) version space, which, 
when added together might be computationally expensive. In 
comparison, in the active learning strategy we derive from the 
general cutting-plane approach, we compute our queries from 
an approximated center of gravity of the version space, which 



is computationally equivalent to a single volume estimation. 
The work of |T6), who propose a margin-based query strategy 
provide theoretical justifications of such strategies and gives 
insights on the foundations the work of |[T] hinges on. Our 
contribution is to show how the cutting planes literature and its 
accompanying worst-case convergence analyzes may give rise 
to theoretically supported query strategies that do not have to 
hinge on margin-based arguments. To some extent, our work 
has connections with uncertainty-based active learning (see, 
e.g. |T7)) which advocates to query the points whose class 
is the most uncertain; our approach may be re-interpreted as 
a theoretically motivated uncertainty measure based on the 
volume reduction of the version space. 

B. Outline 


Algorithm 1 Classical Cutting Plane Algorithm for the local- 
ization of w € C.__ 

Ensure: w € C 

1: compute C°, such that D C and is convex and closed. 
2: t ^ 0 

3: repeat 

4: Compute query point w* in C* 

5: Ask the cutting plane oracle whether w* 

6 : if w* then 

7: Receive a cutting plane {at,bt) 

8: ^ n {a; : {at,x) > bt} 

9: t i — f 1 

10: end if 

11: until w* e C 
12: return w* 


The paper is structured as follows. Section [11] provides some 
background to cutting planes methods and their possible appli¬ 
cation to learning. Section [111] further explores the connections 
between cutting planes and learning algorithms and then 
provides a way to turn cutting planes methods into an active 
learning algorithms. Section IV reports empirical results for 
algorithms derived from our argumentation on the relevance 
of cutting plane methods to machine learning. 


II. Background 

In this section, we first recall the general form of a cutting 
plane algorithm to solve a localization problem. We then 
specialize this algorithm to the case where the convex space 
into which we want to find a point is the version space 
associated to training set D. Finally, in order for the reader 
to get a taste on how cutting planes algorithms give rise 
to learning algorithms, i.e. algorithms that embed features, 
namely, they define compression schemes with targeted small 
compression size, that are beneficial for generalization. 


A. Vanilla Localization Algorithm with Cutting Planes 
In order to solve a problem like 

find w G C, 

for C some closed convex set, a localization algorithm based 
on cutting planes works as follows (see also the synthetic 
depiction in Algorithm [TJ . The algorithm maintains and 
iteratively refines (i.e. reduces) a closed convex set C* that is 
known to contain C. From C* a query point is computed — 
there are several ways to compute such query points; we will 
mention some when specializing localization methods to the 
specific problem of finding a point in the version space later 
on— which leads to two possible options; either a) ly* is in 
C and the tackled problem is solved or b) ui* ^ C. In the 
latter case, a so-called cutting plane oracle is queried with 
upon which it returns the parameters {at,bt) of the hyperplane 
{z : {at, z) = bt} such that this hyperplane separates wt from 
C, i.e., Vw e C, {at, w) > bt and {at,wt) < h. The hyperplane 
is used to reduce C* into C* n {w : {at, w) > bt} (which still 
contains C). For the specific problem 0 of finding a point in 
the version space, the cutting planes rendered by the oracle 
will be such that bt = 0. 


Algorithm 2 The Cutting Plane approach instantiated to the 
problem of finding a point from the version space of D. 

Ensure: w solution of Problem 
l: C° 

2: t^O 

3: repeat 

4: w* ^ Query (C*) > Compute query point w* in C* 

5: if tn* ^ W then 

6: Tit ^ Pick(C‘, ui*) > pick a cutting plane index 

7: ^ C* n{z : yrn{z,Xnt) > 0} 

8: t i — f -f 1 

9: end if 

10: until G W 
11: return w* 


B. Cutting Planes to Localize a Point in the Version Space 

Note that problem Q is scale-insensitive: if in G Wq, then 
Xw G Wo 4S well for any A > 0. In order to get rid of 
this degree of freedom and to make the use of cutting planes 
algorithms possible (they require the sets C* to be bounded), 
we will restrict ourselves to finding a solution vector w* both 
in Wo and in the unit ball 

B = {weX :\\w\\<l}. (5) 

In other words, we will be looking for w* in the constrained 
version space 

w = Wo n (6) 

and the problem we face is therefore: 

find w such that | ^ ^ ^ \ ^ n ^ r ah 

[ yn{w,Xn) > 0, n G [Aj 

In the case of Problem 0, the localization algorithm described 
earlier translates into the one given in Algorithm The 
following changes might be observed when comparing with 
Algorithm [T is now initialized to B, the unit ball, and 
the cutting planes are picked among the hyperplanes —i.e. the 
points of D — defining the version space. 

C. Query Point Generation 

In both Algorithm and Algorithm the strategy to compute 
a query point is left unspecified. There actually exist many 














ways to compute such query points, but they all aim at a query 
point which calls for a cutting plane that will divide the current 
enclosing convex set C* in the most stringent way. It turns out 
that such guarantee might be expected when the query point 
is as close as possible to the ‘center’ of C*, so that the volume 
of is reduced with a positive factor —just as in the well- 
known bisection method, where the factor is 1/2. The center 
of C* is not defined in a unique way, but for the most popular 
query methods, it may refer to; a) the center of gravity of 
C‘, b) the center of the largest ball inscribed in C*, which is 
called the Chebyshev center or c) the analytic center, which we 
will not discuss further (the interested reader may refer to | [T^ 
for further details). We may mention three things regarding the 
center of gravity; i) it is NP-harc|^to exactly compute the center 
of gravity of a convex set in an arbitrary n-dimensional space 
even though some practical approximation algorithms exist; 

ii) it is the query point that comes with the best guarantees in 
terms of convergence speed of the cutting plane method |20|; 

iii) the center of gravity of a polytope is precisely the point 
that is looked for in the case of the theoretically founded Bayes 
Point Machines of ©■ 


III. Results 

This section is devoted to some algorithmic results that can 
be obtained when analyzing the behavior of cutting-plane 
methods for the localization of a point in the version space. 


Algorithm 3 Top ; A Perceptron-based localization algorithm 
for the case of problem 0- Bottom ; The slightly modified 
perceptron algorithm for compression scheme. 


Ensure: Problem 0 


i ^ 1, ^ 0, 0 

repeat 

w* ^ Perceptron (ic' 


i-l 


Xr. 


. . W 112 

if tu* ^ W then 

Pick a cutting plane index n* 

^ C‘ n {z ; yn^{z,Xnt) > 0} 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


t i — t \ 

end if 

until w* gW 
return 

function PERCEPTRON(w®‘“’'‘, Xng, • • • > Xnj^) 
t 0 

while 3ni ; {w*,Xn-) < 0 do 

^ + Xm 

t i — t \ 

end while 
return w* 
end function 


A. Cutting Planes Provide Sample Compression Schemes 

Let T) = xj^)” be the set of all finite training samples 

made of pairs from A x 3^. In short, sample compression 
schemes 0 are learning algorithms A : V ^ that are 
associated with a compression function S : V ^ T) so that, 
given any training sample D, we have A{D) = A{S{D)). 
Sample compression schemes are especially interesting when 
the size |iS(I/)| of the compression set S{D) is small. Indeed, 
generalization guarantees that come with these procedures say 
that the generalization error of fjj = A{D) is, with high 
probability (over the random draw of training set D according 
to an unknown and fix distribution) bounded from above by 
something like 

N - |5(D)| _ |5(,D)i) 

(see ED for a precise statement of the bound). Among 
the most well-known learning compression schemes, we find 
the Perceptron and the Support Vector Machines. 

We claim that Algorithm 0 which finds a point in the version 
space using cutting planes, may be a compression scheme. 

Proposition 1. If Query(C‘) (line Algorithm 0 and 
PiCK(C*, tu*) (line^ are both deterministic then Algorithm^ 
is a sample compression scheme. 

Proof: If the compression set is made of the training 
examples that define the cutting planes, this result is a direct 
consequence of the structure of Algorithm 0 A proof by 
induction that essentially hinges on the fact that, at each 


’To be precise, it is actually #P-hard. 


iteration t, the next query point is deterministically computed 
from C* (only) gives the result. ■ 

A few observations can be made. First, the learning algorithm 
obtained with the assumptions of Proposition [D is a process 
sample compression scheme, that is, even if we interrupt 
the learning before convergence has occurred, running the 
algorithm on the partial compression scheme obtained so far 
gives exactly the same predictor. Second, it is obviously an 
aim to have fast convergence of the localization procedure, 
where fast convergence means few iterations of the cutting- 
plane procedure. This directly translates into the idea of finding 
a point in the version space that is expressed as a combination 
as few vectors as possible, which, by 0, is very beneficial 
for generalization purposes. Later, we will see that there are 
settings for cutting-plane methods that come with guarantees 
on the number of iterations, and therefore on |5(Z/)|, to reach 
convergence. 

B. Perceptron-based Localization Algorithm 

One of the simplest ways to compute a query point for 
Algorithm]^ is to run Rosenblatt’s Perceptron algorithm Q at 
each step and query the normalized solution w* = lb- 

Intuitively, we may expect to be ‘close’ to ru* because 
is essentially the intersection of with a cutting plane 
and much of the geometry of C* might be preserved. According 
to this intuition, lu* should be a good starting point for the 
Perceptron algorithm to be run and to have it output Wt+i- 
Algorithm implements that idea, and reuses the last query 
point as an initialization vector for the Perceptron to compute 
the next query point. Additionally, note that for Algorithm 
[D to match Proposition [D a little technicality is needed; we 
require that datapoints are selected in the lexicographical 















Fig. 1: An Example of version space where the Chebyshev Center 
(light blue) is a bad approximation of the gravity center (dark blue). 


ordetj^when multiple choices are possible (e.g. lineand 171. 
It turns out this simple querying procedure enjoys the same 
convergence rate than a regular Perceptron, with the added 
empirically observed benefit of providing stronger compression 
(see Section IV for empirical results). 


Proposition 2. Consider Problem 0 and let 7 be the radius 
of the largest inscribed sphere in W. Define M the number 
of Perceptron updates performed by the Perceptron-based 
Localization Algorithm (i.e. M is the number of times 
/ine 0 /PerCEPTRONO of Algorithm^is executed). Then 
the following holds: M < 1 / 7 ^. 


Proof: We recall that the usual definition of the margin 
of D is mina;g£)(w*,a;) and note that 7 is related to it since 
Vn e [N], {w* ,Xn) /Wxnh > 7- Let S = {ai,...aM} be 
the sequence of points used to perform Perceptron updates 
across a complete execution of Algorithm Thus, 5 is a 
sequence from D (with possible duplicates) and w* achieves a 
margin at least 7 with all points in S. From @,0 we know 
that the number M of Perceptron updates on any arbitrary 
sequence linearly separable with margin 7 is no more than 
1 / 7 ^. Since we use vf as a starting point to compute the 
execution of the cutting-plane algorithm is tied to the execution 
of the Perceptron algorithm on S. Therefore, there is less than 
1 / 7 ^ Perceptron updates during the execution of the algorithm. 
Alternatively, |5| < 1 / 7 ^ since all points in S correspond to 
a Perceptron update, thus a mistake. ■ 

On a side note, the same argument can be applied to obtain 
similar results with most Perceptron-like learning procedures 
(see for instance | |^ , d^). 


C. Center of Gravity and Approximations 

The question of computing a query point w* is of central 
importance in cutting-plane localization algorithms. As we 
have seen, a simple Perceptron can already yield interesting 


^This is an aibitraiy choice and any total order over K'* can be used instead 


computational results for that matter. A more assiduous analy¬ 
sis of this question can be conducted by looking at the volume 
reduction V(C*“''^)/V(C*) of from one iteration to the next. 
The notion of center of gravity is going to be pivotal to this 
end. 

Definition 1 (Center of Gravity). Let C be a closed set in 
K". The center of gravity (CG) cg(C) of C is defined by as 
cg(C) = fczdz/f^dz. 

The center of gravity is deeply tied to the volume of C* and 
plays a central role in devising cutting-plane algorithms for 
which the volume reduction 'V(C^~^^)/'V(C^) is the largest. 
Theorem [T] reports one of the most fundamental property of 
the center of gravity (see 

Theorem 1 (Partition of Convex bodies). Let C € a convex 
body of center of gravity cg{C) and h a hyperplane such that 
cg(C) £ h. Thus, h divide C in two subsets Ci and C 2 and the 
following relations hold for i = 1,2: V{Ci) > e~^V{C) 

The center of gravity method proposed by ||2^, consists 
in querying w* = cg(C‘) and typically have a very fast 
convergence rate as the version space is almost halved at each 
step. More precisely, a direct consequence of Theorem[^is that 
the volume of is bounded by V(C*) < (1 — l/e)W(C°). 
However, computing the center a gravity is hard, making 
the center of gravity method impractical. Instead, one has to 
consider structural or numerical approximations to the center 
of gravity. 

Definition 2 (Chebyshev’s Center). Let C a set in M". Cheby¬ 
shev’s center (CC) of C, cc(C) is the center of the largest 
inscribed ball in C: 

cc(C) = argminmax — zjl^. 

Z Z 


Chebyshev’s center is used as a computationally efficient 
approximation of the center of gravity for cutting-plane al¬ 
gorithms since the late 70’s p8) (see, e.g. | [29) for a linear 
formulation of the problem). Unfortunately, the interesting 
property of Theorem [T] does not carry over with Cheby¬ 
shev’s center. One problem in machine learning related to 
Chebyshev’s center is the extensively studied Support Vector 
Machine (SVM) | |30l defined as : 


min -||tu 

w 2 


2 

2 


s.t. 


Un {w,Xn) >1, n e [TV]. 


(9) 


A notable property of the SVM is that its solution wsvu is 
closely related to the center of the largest inscribed ball in W 
and is an approximation of the center of gravity R). Indeed, 
wsvM is actually a rescaled Chebyshev’s center fTTra. 

On the other hand, numerical approximations aim at finding 
a point that is in the close neighborhood of the center of 
gravity. One of the contributions of this paper is to give a 
generalized version of Theorem for approximations of the 
center of gravity, thus laying a theoretical justification for these 
methods. 

Theorem 2 (Generalized Partition of Convex bodies). Let C 
be a closed convex body in and cg(C) its center of gravity. 
Let hx a hyperplane of normal vector x, ||x ||2 = 1 and define 

















the upper (resp. lower) partition C"*" (resp. C ) of C by hx as 
= C n {w S : [x, w) > 0} 

C“ = C n {w S : (x, w) < 0 } . 

The following holds true: if eg(C) + Ax € C'*" then 
V(C+)/V(C) > e-^(l - 


where 


A = A0d 


V{C)Hc^ 

RdHc- 


with A € M on arbitrary real, 0^ a constant depending 
only on d, R the radius of the (d — 1)-dimensional ball 
B of volume V[B] = V [C n {ui € : (a;, w) = O}] and 

Hc+ = niaxjjgc+ a^x (resp. Hq- = min^gc- cl^x) 


Proof: The proof is a (non-trivial) extension of Grun- 
baum’s one for Theorem p4| . Due to space restriction, 
we cannot expose it here in full and refer the interested 
reader to http://pageperso.lifuniv-mrs.fr/~ugo.louche/paper/ 
activeCPSuppl.pdf ■ 

Theorem |2] extends Theorem [T] to the situation when an 
approximation of the center of gravity is considered; it reduces 
to Theorem[^when applied to the very center of gravity. This is 
to the best of our knowledge the first result of this kind and this 
is a result that is of its own interest, wich may benefit to many 
fields of computer science. Here, the purpose of Theorem 
is essentially to validate the use of approximations of the 
center of gravity cg(C) in the procedures at hand, which is 
inevitable due to the complexity of exactly finding this point. 
We will more precisely use it in two occasions: a) for center- 
of-gravity-based compression scheme methods and b) in the 
active learning setting (see below). 


D. Active Learning with Cutting Planes 

An interesting situation of learning is that of active learning 
when the algorithm is presented with unlabelled data and it 
has to query for the labels of the training points that carry 
the most information to build a relevant decision boundary. 
Given a volume C inside which a good classifier w* for the 
classification task at hand is known to lie, the amount of 
information carried by a labeled training point (x, y) (where 
y has been queried) might be for instance measured by how 
[x, y) can be used to identify within C an (hopefully small) 
volume C QC where w* lives. Termed otherwise, the amount 
of information provided by (x, y) might be measured as the 
volume reduction induced by the knowledge of {x,y): this is 
exactly the type of information cutting-plane methods build 
upon. We take advantage of this philosophy shared by active 
learning methods and cutting-plane algorithms to argue it is 
easy to transform a cutting-plane algorithm into an active 
learning method. Based on the idea of maximum volume 
reduction, the question to address is simply that of identifying 
a training pattern x in D such that, independently of the label 
it might receive, is guaranteed to define a cutting hyperplane 
of equation {x,w) =0 that intersects the current convex C in 
a controlled way. To do so, a typical good query point is one 
that is as close as possible to the ‘center’ of C, where center 
may have the few meanings discussed above (cf. center of 
gravity, Chebyshev’s center). The algorithm given in Table |^is 


Algorithm 4 Top: a generic cutting-plane active learning 
procedure; ru* is computed as the ‘center’ of C* —center my 
refer to the center of gravity of the Chebyshev center. Bottom: 
a possible implementation of Query(): sampling strategies are 
given in, e.g., g), gT), 132|. 


l: C° 

2: t ^ 0 

3: repeat 

4: w* ^ center(C*) 

5: Xfiti yn, ^ Query(C‘,£I) 

6 : if < 0 then 

7: ^ C‘n {z : > 0} 

8: t i — f -f 1 

9: end if 

10: until C* is small enough 

11: return w* 

12 : 

13: function QuERY(C,iA) 

14: Sample M points si,... sm from C 

15: S^Y^^=iSk/M 

16: X ^ argmin^-.g!)(g, x*) 

17: y ^ get label from an expert 

18: return x, y 

19: end function 


a generic active learning algorithm that is based on the classical 
cutting-plane approach. 

Making active learning algorithms from cutting-plane methods 
is a route that has been taken by ||T|, even though the connec¬ 
tion with cutting-plane algorithms was not clearly identified. 

Being able to approximate the center of gravity of a convex 
polytope is pivotal for the design of active learning strategies. 
It is interesting to note that in the recent years, methods have 
been devised to uniformly sample from the version space such 
as the Hit-and-Run algorithm of pT| or a billiard algorithm 
of p^ . More recently, the Dikin Walk algorithm of 
provided a strongly polynomial algorithm for approximate 
uniform sampling over the version space while the Expectation 
Propagation method of p^ gave a Bayesian interpretation 
of billiard algorithms. Notably, these methods have been suc¬ 
cessfully used with cutting planes for active Boosted Learning 
p5) . Another practical approach we should mention is the one 
proposed in Q that consists in repeatedly running a Perceptron 
over a permutation of the training set: in the active learning 
setting, the number of labeled points available is just too low 
to produce interesting approximation of the center of gravity 
with this method. 

A by-product of our active learning procedure is that we now 
solve a Bayes Point Machine (BPM) problem Q at each step t 
by finding the center of gravity of the current convex body C*. 
Therefore, we can turn our active learning procedure into a full 
active learning algorithm—that we dub Active-BPM —for 
free by using the center of gravity for classification. Note that 
this is one of many possible instantiations of our procedure, 
which is nonetheless of interest as it is the BPM-counterpart 
the Active-SVM algorithm of Tong and Koller Q. 

In conclusion. Theorem provides a general guideline to 
systematically query the training point that comes with the best 








volume reduction guarantees. This is a theoretically sound and 
viable strategy for active learning that comes with a theoretical 
bound on the induced volume reduction, the lack of which was 
an essential limit of the Chebyshev’s center-based method of 

0 . 

IV. Numerical Simulations 

Here, we present some empirical simulations based on the 
algorithms described throughout this paper in both passive and 
active learning settings. 

A. Synthetic Data and Perceptron-based Localization Algo¬ 
rithm 

We generate a toy dataset of 1,000 2-dimensional datapoints. 
Each point is uniformly drawn on a 20-by-20 square centered 
at the origin. We label this dataset according to a classifier 
w* uniformly drawn over the unit circle. In order to have 
only positive labels, negative examples are reflected through 
the origin. We then enforce a minimal margin 7 by pruning 
examples Xi for which {w*,Xi) < 7. This last modification 
allows us to have some control over the size of the version 
space W. The downside of this is that we no longer have 
exactly 1,000 datapoints (though during our experiments we 
noted that the size of the dataset stays mostly the same for 
reasonable margin values). 

For these experiments, we use the Perceptron-based Local¬ 
ization algorithm (Algorithm |3i. We implement it with three 
different oracle strategies for selecting cutting planes. The first 
strategy (which we call Largest Error) picks the cutting plane 
with the lowest margin. The second one {Smallest Error) picks 
the cutting plane with the highest negative margin, that is to say 
points that are incorrectly classified but close to the decision 
boundary. Finally, the third one {Random Error) simply picks a 
cutting plane with negative margin at random. It should also be 
noted that our instantiation of the Perceptron algorithm picks 
the update vector that realizes the lowest margin for its internal 
update—line ( [TSl ) of Perceptron() in Algorithm This is 
mostly an arbitrary choice and we only mention it for the sake 
of repoducibility. 

The first experiment consists in a single run over a dataset of 
margin 7 = 0.1. We monitor both the number of cutting planes 
generated and the number of internal Perceptron updates for 
each cutting plane. The presented results are averaged over 
1,000 runs. 

The left pane of Figure supports the soundness of our 
approach in the case of a compression scheme with no more 
than 6 cutting planes for the best strategy (Largest Error). 
Additionally, we can observe a sharp decrease after the third 
cutting plane with this strategy and 80% of the time, only 4 
cutting planes are required to model the dataset. In contrast, 
the right-hand side of Figure reveals a trade-off between 
the number of cutting planes used and the number of internal 
updates for each cutting plane. We observe a smooth shift 
across our three strategies with Smallest Error putting the 
emphasis on small number of internal updates. In all respect, 
the Random Error strategy acts as a middle ground between 
the two other extreme approaches. 

For the second experiment the margin (i.e. the volume of 
W) is variable with values between 0.01 and 0.3. We also 


monitor the total number of internal updates rather than the 
per cutting plane value for the three strategies and a regular 
Perceptron Algorithm Remind that this value is bounded 
from Proposition This bound also holds for the regular 
Perceptron. 

The previously observed behavioral shift across the three 
strategies is conhrmed by Figure]^ Additionally, some relative 
robustness is observed with respect to 7, especially when the 
emphasis is put on querying a small number of cutting planes. 
It is interesting to note that the Random Strategy makes nearly 
as few updates as Smallest error while still querying a— 
relatively—low number of cutting planes. Finally, all three 
strategies are making slightly less updates than the regular 
Perceptron. To conclude, note that the theoretical bound of 
Proposition [2] is far too big to be plotted on the plot on the 
left of Figure ]^ 

B. Active Learning on Real Data 

We illustrate our method for active learning on text classifica¬ 
tion data. For easy comparison, we follow an experimental pro¬ 
cedure similar as the one in 0. Namely, we use the Reuters- 
21578 —ModApte variation— and Newsgroups dataset^ The 
Reuters dataset is composed of 8, 293 documents represented 
in TF-IDF form for 18,933 words. The dataset spans 65 
topics such as Earn, Coffee or Cocoa and is split in 5, 946 
training examples and 2,347 test examples. On the other hand, 
the Newsgroups dataset accounts for 18,846 documents of 
26,214 features splitted in 20 topics. Half of this dataset 
is uniformly picked for training while the rest is kept for 
testing purposes. On both datasets we train a “one-versus- 
all” classifier for each class. We start by creating a pool of 
unlabeled training examples sampled from the training set. 
Then we run Algorithm We use two variations of the 
Query() function; one based on the Chebyshev center (note 
that this is equivalent to the Active-SVM of 0), and the 
other based on an approximation of the center of gravity 
from Minka’s Expectation Propagation method p4) . This last 
approach corresponds to the Active-BPM algorithm and 
has, to the best of our knowledge, never been used before. 
It is a direct application of Active Learning algorithms with 
Cutting planes method to the Bayes Point Machine. For both 
methods, we use two pools of different sizes (500 and 1, 000 
examples). For initialization reasons, each pool comes with 
two already labeled vectors]^ All the computations are done 
with a linear kernel and the presented results are class-wise 
accuracy measurements on the test examples over the 10 most 
represented classes. The values reported here are an average 
of these measures over 25 runs. We complement these two 
datasets with Gunnar Raetsch’s Banana dataset. The Banana 
dataset is a widely used bataset of 2-dimensionnal points split 
into two classes from which we extract 400 training and 4900 
test examples. Due to its small size, the whole training set 
is used for the pool of unlabeled example. The computations 


^More precisely, we use the exact same Perceptron than the one used for 
the internal loop but ran on the full dataset 

^Available at http://www.cad.zju.edu.cn/ home/dengcai/Data/TextData.html 
^SVM and CC are computed with libSVM: 
http://www.csie.ntu.edu.tw/ cjlin/libsvm/. BPM and CG are computed 
from Minka’s own implementation of EP for BPM in matlab: 
http://research.microsoft.com/en-us/um/people/minka/papers/ep/bpm/ 






Cutting Plane CP iterations 

Fig. 2: Left : for each value i the bar represents the empirical probability (over 1, 000 runs) to query at least i cutting planes. Right: each bar 
represents the number of internal Perceptron updates computed after each Cutting Plane loop. 
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Fig. 3: Left: The average number of cutting planes used for each strategy with respect to the value of 7. Right: the total number of internal 
updates with respect to 7. The fourth plot corresponds to a regular Perceptron 


are realized with an RBF kernel of parameter a = 0.5 and 
presented results are averaged over 50 runs. 

Figure graphically depicts the behavior of the so-called 
Active-SVM Q and the Active-BPM algorithms on each 
dataset. Namely, in both algorithms, the queries are selected 
according to their distance to the “centroid” of C, which, in 
turn, serves as classifier. The difference between these two 
algorithms lies in that Active-SVM uses the Chebyshev 
center and Active-BPM the center of gravity for centroid. 
In Figure data are represented by circles of squares whether 
they correspond to results achieved by Active-SVM or 
Active-BPM. Additionally, for the Reuters and Newgroups 
datasets, dashed plots correspond to the pool of 500 examples 
while dotted plots relate to the pool of 1000 examples. The 
error bounds on the third plot (Banana) correspond to the 
usual standard deviation. Each plot represents the accuracy of 
those algorithms with respect to the number of queries made. 
We can see that Active-BPM systematically outperforms 
Active-SVM and increases its accuracy faster for all datasets, 
already attaining an accuracy of 0.9 after roughly 10 queries 
for both Reuters and Newsgroups datasets. Both algorithm 
seem to stabilize after 30 queries, with the Active-BPM 
being slightly more accurate than its SVM counterpart. For 
the Banana dataset, the accuracy increase in the first queries is 
a lot smoother, with an accuracy for Active-BPM of roughly 


0.8 after 20 queries. Both algorithms seem to have converged 
after 60 queries. Comparatively, not only does Active-BPM 
clearly dominate its SVM counterpart but it is also more stable 
as evidenced by the error bars which become negligible past 
the 60* query. 

V. Conclusion and Future Directions 

In this paper, we have shown that deep connections exist 
between Localization methods and Learning algorithms. Both 
fields have extensively characterized and studied similar con¬ 
cepts over the past years, sometime independently. On the 
other hand, complementary results have been found in each 
community. A notable example is the absence of a kernel 
approach in the Cutting Planes literature while center of 
gravity methods were mostly unknown in machine learning 
until Herbrich’s BPM Q. We may also mention that the 
Cutting planes’ equivalent of the famous SVM p0| appears 
as soon as the 70’s in | |28t . This work is a testimony on 
how it is possible to derive new learning algorithms, both 
efficient and theoretically funded, by reformulating Cutting 
Planes approach for the learning paradigm. Besides the cutting 
plane-related flavor of the present work, it should be restated 
that Theorem has a value that goes beyond the scope of 
this paper. A held that may be impacted by this result is 
obviously that of computational geometry where most of the 
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Fig. 4: Accuracy on the Reuters (left) and Newsgroups (middle) datasets for Active-SVM and Active-BPM for pools of 500 and 1000 
examples. Left: accuracy with error bars on the Banana dataset (Gunnar Raetsch) for Active-SVM and Active-BPM. 


results about the computation of centers of gravity come from; 
nonetheless, it should be noted that more closely related works 
could also benefit from our result. For instance, if we consider 
the active learning methods whose query steps rely on explicit 
exploration of all the possible query/label combinations (see, 
e.g. 1^), then Theorem provides a tool to devise natural 
and theoretically sound heuristics to effectively locate the most 
informative query points, or, in other words, those that may 
lead to the smallest expected error. 


Among all the possible extensions of this work, one we 
are particularly interested in is to study how these results 
may carry over to the multiclass setting and provide proper 
multiclass active algorithms based on, for example. Crammer’s 
Ultraconservative Additive Algorithms |37|. 
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Appendix 


This appendix is composed of three sections. Section [A| serves 
as reminder of basic notions and results for the proofs of 
the other sections, additionally, we will introduce our set of 
notation thorough this section. Sectionj^consists in a rewriting 
of the proof of Grunbaum in p4) on the partition of convex 
bodies by hyperplanes. The proof is restated in full with proper 
notation as it is the starting point of our result. The last section 
gives the proof of theorem which is an extended version of 
the result of Grunbaum and is one of the contribution of our 
paper. 

A. Hyper-Sphere and Hyper-Ball 

Definition 3 (n-dimensional Sphere). We call n-sphere of 
center O e K" and radius i? G K and write S{0,R) C K" 
the subset 


Definition 7 (Hyper-cone). We call Hyper-cone of dimension 
n -1- 1, base B{0, R) C M" and height H the set : 

H= U 

Alternatively, we can define the apex Z = O H x v„_|_i of 
the hyper-cone and give the following definition : 

Definition 8 (Hyper-cone (2)). We call Hyper-cone of dimen¬ 
sion n -f 1, base B{0,R) C K" and apex Z the convex hull 
conv {{B{0, i?); z}). 

We are now ready to state the core properties of Hyper-cone 
that we will use in the remaining of this document. 

We start with the volume of a Hyper-cone 


S{0, R) = {x&W- ■. llcc - Olb = R} 

Definition 4 (n-dimensional Ball). We call n-ball of center 
O G M" and radius ii G K and write B{0,R) C K" the 
subset 

B{0, i?) = {a; G K” : ||x - Oy < R} 

Alternatively, one can think of a ball as ; 

B{0,R)= y 5(0, r) 

rG[0,ii] 

Definition 5 (Surface of a spehe). We call Surface of the n- 
sphere S{0,R) and write V(5(0,i?)) the n — 1 dimensional 
volume 

v(5(o,i?)) = n^, X 

Where H® is a constant factor depending only on n (e.g Hf = 
2, n| = 27r, n| = 47r and so on ...) 

Definition 6 (Volume of a Ball). We call Volume of the n-ball 
B{0,R) and write 'V{B{0, R)) the n-dimensional volume 

W{BiO,R))= [ W{S{0,r))dr 
Jo 

That is 

nR 

W{B{0,R))= / W^R^-^dr 

Jo 

n 

= n„i?" 

Where n„ = ^ is a constant factor depending only on n. 

B. Hyper-Cone 

From these core definitions, we can now introduce (Hyper)- 
cones and some of their core properties. Intuitively, an Hyper¬ 
cone of dimension n -f 1, center O, radius R and height H is 
a sequence of n-Ball of linearly decreasing radius between R 
and 0, each one living on a difference “slice” of between 
O and 0 -f H. 

Remark 1. We will use v„_|_i to denote the vector of 
with 1 on its n 1 component and 0 elsewhere. 


Definition 9 (Volume of Hyper-cone). Given an Hyper-cone 
C G of dimension n + 1, base B{0, R) C K" and 

height H we call volume and write V(C) the n-I- 1-dimensional 
volume : 

V(C) = ^ v(5(o + /iv„+i,^^^^i?))d/i 

Proposition 3. The volume of the Hyper-cone C C of 

dimension n -\-1, base B{0, R) C K" and height H is 


V{C) 


n -f 1 


H 


Proof: From the definition of volume of a sphere we have 
'V{B{0, R)) = Substituting R by and from the 

definition of the volume of a Hyper-cone we have 



dh 


We substitute h hy u = H — h, du = —dh. 



C. Center of Mass 

With the previous definition formally stated, we can now go 
one step further and define the center of mass or sometimes 
called center of gravity 











Definition 10 (Center of gravity). For a given convex body 
X C K” we call center of gravity^snA write cg(X) G K" 
the point : 


cg(^) 


1 

vw 



xdx 


Proposition 4. Let S a n-dimensional sphere such that S = 
S{0,R). Then cg{S) = O. 


Proof: Without loss of generality, assume that 0 = 0. 
Then, S' = {cc G M" : ||x ||2 = S}. Since ||x ||2 = || — x ||2 it is 
clear that Vx G K" : x G S ^ —x G S. Thus, we can rewrite 
cg(S) as 


cg(5') = 


1 



—xdx 


Thus 


2cg(S) = 


1 


V(S) 

1 


V(S) 

= 0 


/ xdx + 

' X^S 

X — xdx 


—xdx 


xGS 


' x^S 


Proposition 5. Let B a n-dimensional ball such that B = 
B{0,R). Then cg{B) = O. 

Proof: Remind that B can be seen as a collection of 
concentric n-sphere of center O and radii between 0 and R 
(see Def. |^. Then, we can rewrite cg{B) as 

cg(S) = l\g{S{0,r))y{S{0,r))dr 

= O 

Where the last line come from Proposition ■ 


pH 


cg(C') = 


V(0) 


hVn+1 X V 


B ( hVn+l, ~f[~^ 


dh 


pH 


V(0) 


TT 

hVr,+^^{H-hrdh 


(Prop. 1^ 
(Volume of a Ball) 


V(0) 


/ {H - u)v„+i 
Jo 


H 


'^du 


(Subst. u = H — h, see Prop. 


1 

V(^ 

1 

1 

1 

W^) 

n + 1 


pH 


n„i?"7Tv„+i 


pH 


H'^ 


vfdu — 




n+1 




pH 


[ u^du - / u^+ 

Jo Jo 

n.„S"g"+iv„+ i n„i?"iT"+2v„+i 

7T"-i(n + l) 

n„i?"iT2v, 


Vn+l 


n„i?”iT2v 


iF"(n + 2) 

n+1 


n + 1 

n„i?”iT2v 


n+1 


n+1 


n + 2 

n„s"g"v„+i ' 

n + 2 

(Volume of a Hyper-cone) 


1 




= 1 - 


n 


1 


n + 2 
n + 2 — n — 1 
n + 2 


Hvn+i 
Hvn+i 


H 

n + 2 


v„+i 


That is to say, cg{C) is on the segment [O; Z] at a distance 
H/n+2 Of O. 


Proposition 6 (Center of Gravity of an Hyper-cone). Let C a 
n+1 dimensional Hyper-cone (C C of base B{0, R) C 

R" and apex Z such that \\Z — 0\\2 = H. Then, cg{C) is 
located on the segment [0]Z] at a distance H/n +2 of O. 


D. Hyperplane and Halfspace 

Definition 11. We call (n)-Hyperplane of normal w G R" and 
offset 6 G R and write W{w, h) C R" the subset : 


Proof: Without loss of generality, we assume that 0 — 0 
and Z = Hvn+i- By definition, C is a collection of ball, and 
we can rewrite cg(C') as : 


cg(C') 


1 

vM 



B 



H -h 
H 



xV 


B 



W{w, 6) + {x G R" : {w, x) + 6 = 0} 

Definition 12. We call Positive Halfspace of the n-Hyperplane 

W{w,b)c R” and write W'^(w,b) C R” the subset 
H-fi\ ' 

H ) (w, 6) + {x G R" : {w, x) + & > 0} 


From Proposition 1^ it is clear that cg(C) lies on the segment 
[0',Z]. The remaining of the proof came by explicitly calcu¬ 
lating cg(C'). 


®For a more complete definition, we should take into account the mass 
distribution over X. Although, in an effort to keep things simple, we assume 
a uniformly distributed mass. 


Conversely, we call Negative Halfspace of W {w, b) C R" and 
write W~ (w, b) C R" the subset 

W~ {w, 6) = {x G R" : {w, x) + 6 < 0} 

Additionally, note that W(w,b) C W^{w,b) but W{w,b) (f 
W~{w, b). 
















































Definition 13. For any subset X C K" and any Hyperplane 
W C M" we call Positive Partition and write C M” the 
subset 

x+ =xr]W+ 


Conversely, for we call Negative Partition and write X C ^ 
the subset 

X- =xnw- 


Proposition 7 (Volume reduction of Hyper-Cone). For any 
(n + l)-Hyper-cone of base B{0, R), apex Z and Height H, 
let set Wcg(c) = H/n+ 2 ) the Hyperplane passing 

by cg{C) ( i.e. cg{C) £ W^cg(C) ) tind parallel to M". Then, 


.H 

V(C+) = / V 


H /„+2 

H 


, H-h 

B hVn+l, - 


H 


dh 


R” 


IH /„+2 


Ilnjj—{H — h)^dh (Volume of a Ball) 


Jo 

n„i?" 


H 
H 


du (Subst. u = H — h) 


^du 


to 



1 

/ \ ra+i 

= V(C) 

V{C+) = V{C) 

> V{C)e-^ 


[(> + + J 

= V(C) 


H” 

n„i?" ij"+i , ^ 

X I 1 - 
1 


n + 1 


n + 2 

n+1 


n + 2 

n+1 


n+1 


Proof: We start by proving the right-hand side of the 
relation. Let set n' = n + 1 and divide both side by V(C') 
then we can rewrite it as 


n + 2 

n + iy+^ 
n + 2 ) 


((n -b 1) X +Y 
= V(C) ' - >--Ji+T 


n+1 


(?T.-b2) X 


1 


= V+) ^ 

\ n+1 , 


n+1 


(1 


— > e 
1' — 


From the usual definition of e we have that 

lim =e 

n—)-oo y n J 

++ lim -^ = e“^ 

And by standard arguments we can show that 

1 1 
> 


= V(C) 
= V(C) 


1 


n+1 






n+1 


+ 




n' + l 


E. Setting 

For the remaining of this document, let AT be a (full dimen¬ 
sional) convex body in 

Definition 14. For any convex body K £ we say that 

K is Spherically Symmetric along the unit vector v if and 
only if VA £ M the cut of K by the hyperplane W(v, A) (i.e. 
AT n W(v, A)) is a n dimensional hypersphere of center Av 


Therefore 


(1+i^) 


- > lim 


^-1 


(i+i^r 


This section consists in a rewriting of the proof of 
instantiated within the previously defined notation and setting. 

F. Theorem 

Theorem 3. For any convex body K C and any 

hyperplane W. If cg{K) £ then 

-1 


V(A:+) > e-+ V{K) 


Finally, the left-hand side of relation is obtained by direct 
calculation of V((7-|-). The general idea is the same than the 
calculation of V(C) but, instead of integrating over the entire 
height we start at H/n+ 2 , thus ignoring C~. Besides, without 
loss of generality, we assume that 0 = 0 and that Z = Hvn+i- 


Proof: 


Note 1 (Points along v„_|_i). This proof will revolve around 
key points located on the n + 1'* axis o/IR"+^ of base vector 
v„+i. In an attempt to avoid overburdening the notation, we 





























will treat these points as number along the real line when 
context is clear. Therefore, if x = Aiv„+i and y = A 2 V„_|_i 
we will freely write x > y if Xi > A 2 . 

Let W the hyperplane such that W = argminvi/V(_?C+) such 
that cg{K) € K'^. It is easy to see that cg{K) € W : 
if cg{K) W you can always reduce V(iir+) by shifting 
W toward cg{K). Without loss of generality, let’s say that 
cg(iL) = 0 the origin of ]R"+^ and that v„+i is the normal 
vector of W with b — 0, hence W = IL(v„+i, 0). 

In order to ease the comprehension of the proof, we make the 
following assumption that we will lift later on. 

Assumption 1. iL is a convex body which is Spherically 
Symmetric along v„_|_i 

A direct implication of this is that cg{K) = cg{K n W). In 
other words, cg{K) is the center of the n — 1 dimensional 
sphere K (IW (see, for example, the argument of Prop. |^). 

Let C~^ the Hyper-cone of base K CiW and apex Z such that 
C+ C W+ and V(C+) = V(K+). 

Moreover either : 

• = C~^ and Z is the apex of K'^ 

• Z^K+ 

To prove that, remember that each slice K n IL(v„+i,A) 
of K along the n -f 1 axis is a sphere. We look at the 
function tq+Q (resp. Tk+O) which maps each value of 
A S K+ with the radius of the corresponding slice of C+ 
(resp. Ar+).By construction, we know that rc+ (0) = + (0) 

and, by definition rc+ () is a deacreasing linear function. If 
tk+O has any stricly convex part, then there exists an arc 
[rj^+(At), (A 2 )] which is not in and therefore K is 

not a convex set. Therefore + () is concave. Then, because 
r(;7+(0) = rjf+(0), for Z to be in iT+ either is a Hyper¬ 
cone (and rx+ is linear) or V(Ar+) > ¥((7+) (that is to say 
rx+ (X)dA > rc+(X)dA ) which is in contradiction 
with the definition of C~^ 

As a consequence, C~^ is at least as elongated as K~^. In other 
words, the mass of C~^ is more spread along the axis of v„+i, 
this incurs a shift of the center of gravity of cg(C'“*') with 
respect to cg(K~^). Therefore cg(C+) is on the closed segment 
[cg(iT+),Z]. 

Thus, by using the notation introduced in Note [T] : 

0 = cg(A:) < cg(A:+) < cg(C'+) < z 

We now define C~ by extending C~^ such that C = C~ U (7+ 
is a cone of apex Z and V((7“) = V(iT”). Therefore, 

V(C') = V((7+) -f V((7-) 

= W{K+)+V{K-) 

= V(iT) 

Once again, we are interested in the relative position of 
cg(iT”) and cg{C~). We invoke the same arguments than 
before and claim that, in a similar way : 

cg{K-) < cg((7") < 0 = cg((7) 


Remark 2. The proof for this is a little more tricky this time 
though. Part of this is due to the fact that C~ is not a Hyper¬ 
cone in itself and one must consider C and K in their entirety 
for the nonconvexity argument. A possible start is to consider 
the radius increase along the reverse axis v„_|_i = —v„_|_i 
and replicate the previous argument with added attention to 
the slope of r^- () which must be such that tkO as a whole 
is still concave. 

Let a,/3 e K such that a = V(iT''')/v(if) and /3 = 
V(iT-)/v(i^). Then 


cg(iT) = acg(A:+) -f Pcg{K ) 

Or alternatively, by construction of C 

cg(C') = acg((7+) -f /3cg((7") 

Combining these with the previous inequalities, we have that 


cg{K) < cg(C) 


Moreover, we know from Proposition that cg(C') is at a 
distance H/n +2 of its base, where H is the height of (7. 


Let W = IL(v„+i, 6) such that cg{C)^^W write (7+ the 
positive partition of (7 by W, that is (7+ = W+ D C. 


From Proposition 
Moreover, because 
V 


we have that V 


((7+) > e-iV((7). 


(c+). 


^ cg(C') > cg{K) we have that V((7''") > 


Putting all of this together we get that 


v(a:+) = v((7+) 

> v(c^) 

> e-iv((7) 

= e-'^W{K) 

Finally, all we have left is to deal with Assumption This 
is simply tackled by remarking that, by dehnition, spherical 
symmetrization preserve volumes along its axis. Thus, for any 
K of any convex shape it suffices to apply the proof on the 
spherical symmetrization of K : symg(iT) and we have 

W{K+) = y{syms{K+)) > e-^W{symsiK)) = W{K) 


This section is dedicated to the main theorem which is a 
generalization of Theorem to approximate center of mass. 

Theorem 4. For any convex body K C and any Hyper¬ 

plane W of normal v, splitting K in and K~. Let 


X = cg{K)+\ 


{n+l)V{K) 

n Tyn 

n^K-^ 


Hk+ 

{n -\- 2)Hx- 


n 



1 

n -f 2 


V 












Where Hx+ = inaXaG-R'+ Hk^ = o^v ant/ 

Rk+ the radius of the n—1-Ball Bkdw such that V{BKr\w) = 

v{K n w). 

Then, if X £ K~^ the following holds true 

V{K+) > V{K){1 - A)"+^e-i 


Proof: 

The proof start in a similar way than the one of Grunbaum, 
with respect to X. 

Namely, let Assumption [T] hold for now, and let W the hyper¬ 
plane such that W = argmin^it V(Ar+) such that X G 
Same as before, we have that X G W. Without loss of 
generality, let’s say that X = 0 the origin of and 

that Vn+i is the normal vector of W with b = 0, hence 

fT = VFK+1,0). 

Let define the Hyper-cone of base K n W, apex Z such 
that C~^ C W~^ and V(C'“'') = V(Ar+). Moreover, let C~ 
the extension of C~^ such that C = C~ n C'^ is an Hyper¬ 
cone of height H and volume W{C) = V(iT). That is to 
say V(C'“) = V(iT“). From the same argument than before, 
we know that cg(C'“'') (resp. cg(C'“)) is shifted with respect 
to cg(iT''') (resp. cg{K~)), thus, according to the notation 
defined in Note [T] we have that 

cg{K) < cg(C) 

If a: < cg(C') then the exact same argument than the one of 
Section [^applies and 

y{K+) > W{K)e-^ (See Th.j^for details) 

Otherwise, we have that 


cg{K) < cg(C') < X 

The idea of the proof is to find X such that X < X from 
which we can bound the volume of in a similar way than 
before. 


We can compute the volume of C+ as follow : 

rH 


B \ Bq hVn+l, ———R 


dh 


i?" 


V(C+) = / V 

Jx 

= I — hY'dh (Volume of a Ball) 

Jx 

= I Ur^—U^du 

(Subst. u = H — h) 


to 


n„i?" 

Jo 


^du 


Rn 


n-\- 1 


1 - 


1 


-| n+1 


n-\-2 


= V(C)(1-A) 


n+1 


1 - 


1 


-| n+1 


n -f 2J 

(Volume of a Hyper-cone) 
> V(C')(1 - A)"+ie-^ (See Prop. |7|l 


Where injhe first two lines, we allow a sli^t abuse of notation 
and use X as a real as explained in Note IH 

Then, we can rewrite the volume of C'*' as 


V(C'+) = 


rX 


B [ Bo-\- hVn+i, —ff—R 


dh + W{C+) 


Consequently, if X < X then the first term of V(C'“'') is 
positive and therefore, V(C'“'') > V(C'-I-) 


X = cg(C) -f XH 
= cg(C) + A 


1 - 


1 


n-\-2 
(n + l)V(C) 


n„i?" 


1 - 


n-\-2 


(Volume of a Hyper-cone) 


> cg(X) -f A 


(n-H)V(X) 


1 - 


1 


n-\-2 


(cg(C) > cg(X) and V(C) = V(X)) 


Let define 


X = cg(C') -f XH 



n-\-l 


v„+i 


( 10 ) 


Denote by Bq = B{Bq, R) the base of C and remind that C 
has height H and apex Z and remind that cg(C') = i?o + 
H/n+ 2 'vn -f 1. Therefore 


X = H 



+ 


1 

Ti 2 


Vn+l 


Consider W = VL(v„_|_i,6) the Hyperplane of normal vector 
v„_|_i (i.e. W is parallel to W) and offset b such that X G W. 
Then, let (7+ the positive partition of C with respect to W 

c+ = cn]^ 


Unfortunately, we cannot easily compute R directly. Nonethe¬ 
less, since Bq and W are parallel, we can use the Trianngle 
proportionality theorem. Denote Rc+ the radius of the base 
of (7+, that is X n VL and Hc+ the height of C'^ (i.e. the 
distance between X and Z) then we have : 

1 = 

R HRc+ 

From this, we want to bound Hc+ and H since Rc+ is easy 
enough to estimate because it is directly related to K and W. 

From previous argument, we know that Z f. X+ except if 
K'^ = (7+. So let define 

Hx+ = max a^v„_|_i 

Intuitively, Hx+ is maximal distance between a point in X+ 
and X with respect to the axis of Because Z is precisely 



































on this axis, and that Z ^ K'^ (or = ( 7 +) the following 
holds true 

Hk+<Hc+ ( 11 ) 

Conversly, let define H^- as the maximal distance between X 
and any point of K~ with respect to the axis of Again, 

from previous argument, we know that Bq G K~. Note that, 
because C is a Hyper-cone, we know that Bq is at a distance 
H/n +2 of cg(C'). Moreover, remind that we are treating the 
case where X > cg(C'), hence, the distance between Bq and 
X is at least H/n +2 which in turn is smaller than H^-- 
Reordering gives the following 


(n + 2)Hk- > H ( 12 ) 

Putting back equations 0 and ( [T^ together, we have that 

i > 

R {n + 2)Hf^-Rc+ 

Which we plug back into the previous calculation 


X > cg{K) + A 

> cg(A:) -f A 
= X 


(n + l)V(Jf) 

n„i?" 

(n + l)V(X) 

n pn 


1 - 


1 


n + 2 

Hk+ 


(n -f 2)Hx- 


- n 

r 1 1 


1 


n + 2 


Remind that we drop v„_|_i in the above since we treat X and 
X as real numbers (see Note [T}. 

To conlude this proof by rewinding all together. Namely, 

• < cg(C) and 

y{K+) > y{K)e-^ 

• X > cg(C) and we can define (Af) such that 

V(K+) > V(^) > V(A:)(1 - A)"e-i 

Once again, we lift Assumption [T] as before by noting that 
spherical symetry preserves volumes. One difference though 
lies in the fact that computing Rc+ is no longer immediate 
in the general case. Notwithstanding, it can be easily approx¬ 
imated within satisfactory precision. 

As a final note, we may mention that distinguishing between 
these two cases is non-trivial. Hence, without additionnal 
computation, only the worst bound can be guaranteed. ■ 















